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Abstract 

In this paper, we develop new methods for estimating average treatment effects 
in observational studies, in settings with more than two treatment levels, assuming 
unconfoundedness given pre-treatment variables. We emphasize propensity score sub¬ 
classification and matching methods which have been among the most popular methods 
in the binary treatment literature. Whereas the literature has suggested that these par¬ 
ticular propensity-based methods do not naturally extend to the multi-level treatment 
case, we show, using the concept of weak unconfoundedness and the notion of the 
generalized propensity score, that adjusting for a scalar function of the pre-treatment 
variables removes all biases associated with observed pre-treatment variables. We apply 
the proposed methods to an analysis of the effect of treatments for fibromyalgia. We 
also carry out a simulation study to assess the finite sample performance of the methods 
relative to previously proposed methods. 

*shuyang@hsph.harvard.edu 
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1 Introduction 

There is an extensive theoretical and empirical literature on estimating average causal effects 
of binary treatments in observational studies based on the assumption of unconfoundedness 
or ignorable treatment assignment. Under this assumption differences in outcomes for units 
with different treatment levels, but the same values for pre-treatment variables, can be 
interpreted as estimates of causal effects. Much of the literature builds on the seminal paper 
by Rosenbaum & Rubin (1983) (RR83 from here on) which clarified the central role of 
the propensity score (the conditional probability of receiving the treatment given the pre¬ 
treatment variables or covariates) in analyses of causal effects in such settings, and which 
proposed a number of widely used estimators. See Imbens & Rubin (2015) for a textbook 
treatment. 

Although important for empirical practice, much less theoretical work has been done on 
the setting with more than two treatment levels (exceptions include Imbens, 2000; Robins, et 
ah, 2000; Lechner, 2001; Foster, 2003; Hirano & Imbens, 2004; Imai & Van Dyk, 2004; Cole 
& Frangakis, 2009; Cadarette et ah, 2010; Cattaneo, 2010; McCaffrey et ah, 2013; Rassen 
et ah, 2013). Because in settings with multi-level treatments there is no scalar function of 
the covariates that has all the properties that RR83 presents for the propensity score in the 
binary treatment case, it has been claimed that there is no natural analogue to matching 
and subclassification on the propensity score (Imbens, 2000; Lechner, 2001; Rassen et ah, 
2013). 

In the main contribution of the current paper we show that, contrary to these claims, 
the essence of the results in RR83 generalizes to the setting with multi-level treatments. 
In particular, we develop methods for matching and subclassification on scalar functions of 
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the covariates that are valid irrespective of the number of distinct levels for the treatment. 
The key insight is that we do not construct sets of units with the balancing property that 
within these sets the treatment level is independent of the covariates. Doing so would require 
adjusting for the vector of propensity scores with length equal to the number of treatment 
levels minus one. Instead we focus on estimating the average of the potential outcomes 
separately for each treatment level, which requires adjusting only for the probability of 
receiving that particular level of the treatment. This insight allows us to extend some of 
the most widely used methods for the binary treatment case to the multi-level treatment 
case without giving up the dimension reducing property of the propensity score. We provide 
some simulation evidence that demonstrates the relevance of concerns with the previously 
proposed estimators and the promise of the new methods. 

2 Set Up 

Following Rubin (1974) and RR83 we use the potential outcome set up, generalized to the 
case with more than two, unordered, levels for the treatment in Imbens (2000), Lechner 
(2001), Imai & Van Dyk (2004), and Cattaneo (2010). The treatment is denoted by hR G 
W = {1,... ,T}. In the standard binary treatment case T = 2, the two treatments are often 
labeled treatment and control. For each unit i there are T potential outcomes, one for each 
treatment level, denoted by VRtc), for w G W. Implicitly in this notation is the assumption 
that there is no interference between units and no versions of each treatment level (the 
stable-unit-treatment-value assumption, or sutva, Rubin, 1978). The observed outcome for 
unit i is the potential outcome corresponding to the treatment received; 

We also observe a vector-valued covariate or pre-treatment variable, denoted by X*. These 
pre-treatment variables are known a priori not to be affected by the treatment, typically 
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measured prior to the determination of the treatment level, and so there are no multiple 
versions of these covariates corresponding to the different levels of the treatment. These 
pretreatment variables may include hxed attributes of the units, or measurements prior to the 
treatment assignment that predict the outcome, for example prior health status. Although 
we do not stress this in the notation, there is implicitly a temporal aspect to the study with 
three stages: hrst, the pre-treatment variables are measured, or at least they assume their 
values prior to the assignment of the treatment; second, the treatment is assigned or selected; 
third, the outcome assumes its value and is measured. 

We assume the sequence (Wi, Xi, Yi(l),..., Yi(T)),..., (IdW, X^, 1^(1),..., Yn{T)) with 
the potential outcomes is i.i.d., so that the sequence of realized values (Wi, Xi, ..., (Wn, 

Xn,Y^^^) is also i.i.d. 

Following the literature (e.g., RR83), we focus on average treatment effects as the causal 
estimands. This is less restrictive than it may appear at hrst because we can hrst take 
transformations of the outcomes and pre-treatment variables. For the comparison between 
treatment levels w and w', the average ehect is 

t{w,w') = E[Yi{w') - Yi{w)]. (1) 

The expectation is taken with respect to the same population (called the target population, 
Frolich, 2004a) for diherent treatment-level pairs {w,w'). Some researchers, when analyzing 
data with multi-level treatments, have used conventional methods for comparing two treat¬ 
ment levels at a time. Often such analyses use only information on units exposed to one of 
those two treatment levels, which would lead to estimates of — | Wi G 

If the subpopulation of units with treatment levels w or w' is diherent in terms of poten¬ 
tial outcome distributions, these estimands are generally diherent from the t{w,w') dehned 
in (1), because the latter do not condition on hR G {w,w'}. As a result such binary- 
comparison analyses make it difficult to compare E[R(t(;') — Yi{w) \ Wi G and 
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¥J\Yi{w'') — Yiiw) I Wi G {w,w"}] because they refer to different populations. 

In this paper we mainly focus on the case where the different treatment levels are qualit¬ 
atively distinct. In that cases the interest is typically in average effects of the form r(tc, w'). 
In other cases, however, the treatment levels may measure the quantity of a dose. In such 
cases the researcher may be interested in weighted combinations of average effects. For ex¬ 
ample, one might be interested in \wt{w,w + 1), with the weights adding up to one, 

which would correspond to a weighted average of unit increases in the dose. One advantage 
of such estimands is that their variance may be lower than that for particular t{w,w'). In 
this case there may also be particular interest in r(l,T), the effect of the maximum dose. 
Our results also apply to all such estimands. 

3 Weak and Strong Unconfoundedness and the General¬ 
ized Propensity Score 

Our focus is on observational studies where assignment to treatment is not completely ran¬ 
dom. Instead, following a large strand of the observational studies literature, we assume that 
assignment to treatment is unconfounded so that, within subpopulations that are homogen¬ 
ous in observed pre-treatment variables, assignment to treatment is as good as random. This 
is strictly weaker than complete randomization by allowing for general associations between 
the treatment level and the pre-treatment variables. 

3.1 The generalized propensity score 

In this section we discuss the generalization of the notion of the propensity score, introduced 
in the causality literature by RR83 for the binary treatment case, to our setting with multi¬ 
level treatments. In the binary treatment case RR83 dehnes the propensity score as the 
conditional probability of receiving the active treatment rather than the control treatment, 
p{x) = pr{Wi = 1 \ Xi = x). Here we generalize that to the multi-level treatment case. 
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following Imbens (2000); 


Definition 1 (Generalized Propensity Score) The generalized propensity score is the condi¬ 
tional probability of receiving each treatment level: 


p{w I x) = pvriWi = w \ Xi = x). 


3.2 Overlap 

Before formally discussing unconfoundedness assumptions, let us assume that there is overlap 
in the covariate distributions: 

Assumption 1 (Overlap) For all values of x the probability of receiving any level of the 
treatment is positive: 

p{w I x) > 0 for all w, x. 

Without this assumption there will be values of x for which we cannot estimate the average 
effect of some treatments relative to others without relying on extrapolation. In Section 6 
we discuss methods for constructing a subsample with better overlap for cases where this 
assumption is (close to) violated. 

3.3 Strong unconfoundedness 

We start by generalizing the conventional RR83 version of the unconfoundedness assumption 
to the case with multi-level treatments. We refer to this as strong unconfoundedness to 
distinguish it from the weaker condition of weak unconfoundedness. 

Definition 2 (Strong Unconfoundedness) The assignment mechanism is strongly uncon¬ 
founded if 

w ^ (r.(1),...,R.(T)) I W. 
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Here we use the _LL notation introduced by Dawid (1979) to denote (conditional) independ¬ 
ence. 

The assumption of strong unconfoundedness has no testable implications. In a particular 
application the assumption is a substantive one, and often a controversial one. Often it can 
be made more plausible by collecting detailed information at baseline on characteristics of 
the units that are related to treatment and outcome. As a result the dimension of Xj may 
be high. 

One implication of strong unconfoundedness is the following extension of the propensity 
score result in RR83 to the multi-level treatment case: 

Lemma 1 (RR83) Suppose the assignment mechanism is strongly unconfounded. Then 


w,- n 


(l7(l),...,17(T))|(p(l|X,),...,p(T 


11X, 


Because J2w=iP('^i^) — follows that p{T\x) is a linear combination of p{l\x),... ,p(T — 
l|a;), and so we do not need to include p{T\x) in the conditioning set. If there are two levels 
of the treatment, the result in the lemma reduces to the result in RR83. As pointed out 
in Imbens (2000) and Rassen et al. (2013), the dimension reduction property of the lemma 
depends on the number of distinct levels for the treatment, and therefore the result is less 
useful in settings with a substantial number of treatment levels. The problem is that without 
additional assumptions there is in general no scalar function b{x) of the covariates such that 
Wi _LL (Yi{l),... ,Yi(T)) I b{Xi), suggesting that the advantages of the propensity score 
approach do not carry over to the multi-level treatment case. Joffe & Rosenbaum (1999); 
Lu et al. (2001); Imai & Van Dyk (2004); Zanutto, Lu, & Hornik (2005) discuss additional 
assumptions under which functions b{-) exist with this property and whose dimension is 
lower than T — 1. In particular, Lu et al. (2001) assume that a scalar balancing function b{-) 
exists and propose a matching estimator based on b{-), and Zanutto, Lu, & Hornik (2005) 
propose a subclassihcation estimator under this assumption. Nevertheless, in general such 
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functional form assumptions may be controversial. 


3.4 Weak unconfoundedness 

We improve the dimension reduction property of the generalized propensity score by weaken¬ 
ing the requirement of strong unconfoundedness condition to weak unconfoundedness. Define 
the T indicator variables Di{w) G {0,1}; 

1 iiW, = w, 

Di{w) = < 

0 otherwise. 

In terms of these indicator variables strong unconfoundedness is equivalent to 
(a(1),...,A(T-1)) X (f,(1),...,F*(T)) I A. 

Now we can formulate the weak unconfoundedness notion, introduced in Imbens (2000). 

Definition 3 (Weak unconfoundedness) The assignment mechanism is weakly unconfounded 
if for all w G W, 

Di{w) _LL Yi{w) I Xi. 

Although formally it is obviously weaker, we do not wish to argue that weak unconfoun¬ 
dedness is substantively weaker than strong unconfoundedness. In fact neither have testable 
implications, and there appear to be no interesting estimands that are identihed under the 
stronger assumption but not under the weaker assumption. Rather, the two key insights, and 
the motivation for distinguishing between the two unconfoundedness assumptions, are, one, 
that, as shown in Lemma 2 below, weak unconfoundedness is preserved if we condition on a 
scalar function of the pretreatment variables, whereas preserving strong unconfoundedness 
requires conditioning on a set of T — 1 functions of the pre-treatment variables, as shown in 
Lemma 1, and two, that weak unconfoundedness is sufficient for identifying average causal 



effects, as formalized in Lemma 3 below. 


Lemma 2 (Weak Unconfoundedness) Suppose the assignment mechanism is weakly uncon¬ 
founded. Then for all w G W, 


Diiw) _LL Yi{w) I p{w\Xi). 


Lemma 3 (Average Causal Effects Under Weak Unconfoundedness) Suppose the assignment 
mechanism is weakly unconfounded. Then 


E[Fi(w;') - = E | Wi = w',p(w' | X,)] - E | Wi = w,p(w | X^)] 


Lemma 3 is the key result. For its interpretation it is useful to compare it to the standard 
result under strong unconfoundedness. Under the strong unconfoundedness assumption we 
create subpopulations where we can simultaneously compare units with all different levels of 
the treatment, leading to 

ElV,(w')-V,(w)] 


= E 


I W, = w',p(l I X,),... ,p(T-l|X,)] -E | W, = w,p(l | X,),... ,p(T-l 


= E 


E[W{w') - W{w) I p{l I X,), ...,p{T- 1|X,)] 


To allow for comparisons of all treatments these subpopulations were dehned by common 
values for the full set of T — 1 propensity scores (p(l | Xj),... ,p{T — 1 | X*)). Under weak 
unconfoundedness we do not, and in fact cannot, construct such subpopulations. However, 
in order to estimate the average effect E[Yi{w') — Yifw)] it is not necessary to do so. Instead, 
we construct, for each treatment level w separately, subpopulations where we can estimate 
the average value of the potential outcomes, but only for that single treatment level. For 
treatment level w these subpopulations are dehned by the value of a single score, p{w\Xi), 
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leading to the equality 


E[yiH] =E|E[F.°'’‘^ I Wi 


w,p(w I Xi)' 


That difference in focus allows us to reduce the dimension of the conditioning variable to a 
scalar, irrespective of the number of treatment levels. 


4 Matching 

In this section we discuss matching methods. First we discuss conventional matching on 
the full set of pre-treatment variables. This is not a new method, but it will be useful to 
contrast with the proposed methods. Then we discuss how the generalized propensity score 
can be used to develop a new matching estimator that matches only on a scalar function of 
the pre-treatment variables. 

4.1 Matching 

Frolich (2004b) demonstrates covariate matching in multi-level treatments. Here we focus on 
nearest neighbor matching. Other modihcations include multiple nearest neighbors match¬ 
ing, kernel matching and so forth. Reviews of general matching methods can be found in 
Imbens (2004); Imbens & Rubin (2015); Huber, Lechner & Wunsch (2013). Dehne the cov¬ 
ariate matching function rricov : WxXi—)■ {l,...,iV}as the index for the unit with treatment 
level w that is closest to x in terms of covariates (ignoring ties): 

mcoviwjx) = aig min ||Xj—a;||. 

j:Wj=w 

Here we use 11 ■ 11 to denote a generic metric. In practice one would typically use the Mahala- 
bonis metric, where ||a;—a;'|| = {{x—x'Y'V~^{x — x')Y^‘^,wit]iV = X].(Xi —X)(Xj —X)^/iV, 
and X = Note that the set of indices we search over includes all units, includ- 
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ing unit i itself, so that for all i, rricoviWi, Xi) = i. Given the covariate matching function 
mcov{w,x) the potential outcomes for unit i are imputed as 


Y^{w) 


■y^obs 

^ mcov{'w,Xi)i 


ioT w = 1,... ,T. Now we estimate t{w, w') as 


N 




2 = 1 


( 2 ) 


Note that to estimate t{w,w') we impute potential outcomes Yi{w) and Yi{w') even for units 
who did not receive either treatment level w or w'. This ensures comparability of average 
treatment effects for different pairs of treatments. 


4.2 Matching on the Generalized Propensity Score 

Just as in the binary treatment setting, matching on all covariates is not an attractive 
procedure in the multi-level treatment setting if the number of covariates is substantial 
{e.g., Abadie & Imbens, 2006; Imbens & Rubin, 2015; Imai & Van Dyk, 2004). In the binary 
treatment case RR83 proposed matching on the propensity score to reduce the dimensionality 
of the matching problem. If p{l\x) is the Rosenbaum-Rubin propensity score, the matching 
function for the binary treatment case would be 

= arg min |b(l|Xj) -p||. (3) 

^ j:Wj=w 
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One could generalize that to the multi-level treatment case by matching on the full set of 
scores, leading to 

... ,pr-i) = arg min 

j:Wj=w 

Here we generalize this to the case with multi-level treatments in a way that allows 
for a scalar matching variable. In this case matching is conceptually quite different from 
matching on covariates. We separate the estimation of t{w,w') = — E[l^(t(;)] into 

the two terms. First we focus on the problem of estimating E[yj(t(;)]. Dehne the generalized 
propensity score matching function as 

mgps{w,p) = arg min \\p{w\Xj) - p\\. (5) 

j:Wj=w 

Here the treatment level w enters into the matching function not only by limiting the set of 
potential matches to the set of units with Wj = w, but also in the function of the covariates 
that is being matched, p{w\Xj). In covariate and conventional propensity score matching 
the treatment level only affects the set of potential matches. 

Given the generalized propensity score matching function we impute Yi{w) as 

Yiiw) — 






( 4 ) 


The average effect is estimated as 


N 

fgps(w,w') = 

i=\ 


v^obs 

^ ,p{w'\Xi)) 


v^obs 


( 6 ) 


Note that the difference Y^^ , , ,,, v — Y^^ , , n., in (6) is not generally an estimate of 

m^ps{w,p{w\Xi)) V / 'J 

an average causal effect, whereas in (2) the difference Y^^ , . ^ ^ — IT’’® / , ^ n is an estimate of 

O ’ ' ' mc.ov(w,Xi} mcov(w,Xi) 


12 



the average causal effect E[yj(t(;') — Yi[w) \ Xj]. In the binary treatment case this distinction 
between covariate and propensity score matching does not matter; in that case there are 
only two values for w, w = 1, 2, so that p(l | x) = 1 — p{2 \ x), and therefore matching on 
p{l I x) is the same as matching on both p{l \ x) and p{2 \ x), and the same as matching on 
p{2 I x). 

In Web Appendix, we provide mathematical details for inference and show that under 
mild regularity conditions, the matching estimator based on the generalized propensity score 
or the estimated generalized propensity score is asymptotically normally distributed. 

5 Subclassification on the Generalized Propensity Score 

In the binary treatment literature, a common alternative to matching is subclassihcation or 
stratihcation on the propensity score, originally proposed by RR83. To put our proposed 
methods for the multi-level treatment case in perspective, let us briefly summarize their 
approach for the binary treatment case in our current notation to show why it does not 
directly extend to the multivalued treatment case. Divide the sample into a number of 
subclasses by the value of the propensity score p{l\x). Based on Cochran (1968) who shows 
that this removes much of the bias, researchers often use hve subclasses. To be specihc, let 
quintile of the empirical distribution of p(l|Xj), for j = 1,..., 4, and dehne 
_ g = 1. Then we construct the hve subclasses, based on the propensity 

score being between and For subclass j one can estimate the average causal 

effect of treatment 1 versus treatment 2 as 
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where Njy^ is the number of units in subclass j with treatment level w. The overall average 
treatment effect is then estimated by averaging over the subclasses: 


f(l,2) 


5 


E 


Nji + Nj2 

N 




Because all Nji + Nj 2 are close to equal, at most differ by 1, (assuming there are no ties) 
this is essentially a simple arithmetic mean of the J estimates fj(l, 2). 

Now consider the multi-level treatment case. We are interested in t{w,w') for some pair 
of treatment levels w and w'. Again, and this is a cornerstone of our approach, we write this 
as a difference of two expectations, t{w,w') = — E[l^(t(;)] and separately estimate 

the two terms E[yj(t(;')] and E[l^(t(;)]. To estimate the second term, E[yj(t(;)] we construct 
subclasses or strata based on p{w\x). Let be the quintiles of p{w \ Xi) in the sample. 

Then the average value of Yi{w) in subclass j is estimated as 




N. 


JW 


E 


Y 


obs 


r.qf_TYp{w\Xi)<qf'^'^\w,=w 


where Nj^ is the number of units with qPY\^) < p(w I Xi) < ^ overall 

average of Yi{w) is then estimated as 

^ AT 

E[b(w)] 

i=i 

The key is that, in contrast to what is done in the binary treatment case, we do not 
construct subclasses defined by similar values for the T — 1 propensity scores such that we 
can estimate causal effects within the subclasses. Instead we construct subclasses dehned 
by similar values for a single propensity score at a particular treatment level so that we can 
estimate the average potential outcome for that treatment level within the subclasses, and we 
do so separately for each treatment level, with different subclasses for each treatment level. 
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In the binary treatment case this amounts to the same thing because the two propensity 
scores p(l|a;) and p{2\x) are linearly related, but in the multi-level case these two approaches 
are different. 

6 Assessing And Ensuring Overlap 

6.1 Assessing balance 

We focus on assessing balance in the covariate distributions in terms of the propensity score 
as well as directly in terms of the covariates, following the discussion in Imbens & Rubin 
(2015) for the binary case. For each treatment level w, we calculate the average values for 
each component of the covariate vectors and their corresponding sample variances: 

X^ = N-^ Y, X,, ^ndSY = {N^-l)~^ Y {X,-X^f. 

i:Wi=w i:Wi=w 

Dehne also for each treatment level the average value of the covariates for units with a 
treatment level different from w and the average variance: 

T 

X^={N-N^)-^ Y Xi, andS^^\w=T-^Y^Xn,- 

iiWi^w i=l 

respectively. The hrst approach to assessing the covariate balance is to inspect the normalized 
differences for each covariate and each treatment level: 

= (X^-X^)/Sx\w (7) 

We can also assess balance by looking at the generalized propensity score. For each treatment 
level we can calculate the normalized difference for the generalized propensity score for that 
treatment level: 

= (^p{w I X)^ - p{w I X)_) / Sp(^x)\w (8) 
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where p{w \ X)^ = E Pi'<^ I E) and p{w \ X)- = {N - N^) ^ E Pi'>^ I E)- 

i:Wi=w i:Wij^w 

Finally, one may wish to plot a histogram of p{w \ Xi) for the units for Wi = w and a 
histogram of p{w \ Xi) for the N — units with Wi ^ w in the same hgure. 

6.2 Improving Overlap 

In many applications there are regions of the covariate space with low values for the probab¬ 
ility of receiving one of the treatments. This is likely in the setting with a binary treatment, 
but even more likely to be an issue in settings with many treatment levels. Of note, lack 
of overlap affects the credibility of all methods attempting to estimate all pairwise average 
causal effects from the common population. In that case we may wish to modify the estim- 
ands to average only over the part of the covariate space with all treatment probabilities away 
from zero. The question is how to choose the set of covariates with overlap. For the binary 
treatment case. Crump et ah (2009) proposed a method for improving overlap by trimming 
the sample. Specihcally they suggest dropping units from the analysis with low and high 
values of the propensity score. The threshold for dropping units is based on minimizing the 
variance of the estimated average treatment effect on the trimmed sample. By trimming 
the sample, this method generally alters the estimand to the so-called feasible estimand, by 
changing the reference population. Using the feasible estimand is widely recommended in 
the literature, as long as we are careful to characterize the resulting quantity of interest. 
Here we generalize the Crump et ah (2009) approach to the multi-level treatment case. We 
focus on average treatment effects for subsets of the covariate space. Formally, define the 
conditional average treatment effect: 

t{w,w'\C) = E[Yi{w') - Yi{w) I Xi e q. 

The semiparametric efficiency bound for t{w, w' \ C) is, building on the work by Hahn (1998) 
and Hirano, Imbens, and Ridder (2003), under homoskedasticity and constant treatment 
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effects, 


Y{w,w'\C) = 


a 


-E 


1 1 
+ 


Xi e C 


pr{Xi e C) [p{w\Xi) p{w'\Xi) 

In the binary case Crump et al. (2009) proposed choosing C to minimize Y{w, tc' | C), which 
leads to dropping units with p{l \ X*) < a and units with p{l \ X*) >1 — 0 ;, with a an 
estimable function of the marginal distribution of the propensity score. 

For the multi-level treatment case we suggest focusing on the subset of the covariate 
space C that minimizes 


_ 2rr2 


E 


E 


^^p{w 


X,- 


X e C 


Under homoskedasticity and a constant treatment effect this will lead to a set C of the form 


= <^ X e X 


E 


^ Piw I Xi) 


<A , 


where the threshold A satishes 


A < 


pr 


-E 


■ T 

E 

W 


■^^p{w 


Xi 


T 

E 

W 


■^^p{w 


X, 


< A 


To implement the trimming method in practice in the multi-level treatment case we replace 
the expectation by an average and then hnd the largest A that satishes the inequality. 


7 A Simulation Study 

In this section we assess the performance of the two new estimators in cases of multi-level 
treatments (matching on the generalized propensity score, GPSM, and subclassihcation on 
the generalized propensity score, GPSS) in a Monte Garlo study relative to hve previously 
proposed estimators, hrst the simple difference in average outcomes (DIF) by treatment 
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status, second pairwise propensity score matching (PPSM) that compares two treatment 
levels at a time using the binary propensity score matching on the units exposed to one 
of those two treatment levels, third the estimator based on matching on the set of T — 1 
propensity score set (PSSM), fourth the estimator based on weighting, and hfth, matching 
on all covariates (COV). In the binary treatment and missing data case previous simulations 
have found that weighting estimators can have high variability, e.g., Kang & Schafer (2007) 
and Guo & Fraser (2010), especially if the probabilities are close to zero. Frolich (2004a) 
found that the weighting estimator was inferior to pairwise matching estimators in terms of 
root mean squared error. This is even more likely to be a concern in settings with multiple 
treatment levels than in the binary treatment case because, with the probabilities for the 
T treatment levels adding up to one, with T large some probabilities are likely to be close 
to zero. Because in the binary treatment case it has been found that matching on high¬ 
dimensional covariates is not practical for commonly found sample sizes {e.g., the theoretical 
results in Abadie & Imbens, 2006), it is likely that in settings with many treatment levels 
matching on all scores is not effective either. These results motivate us to compare the 
seven estimators in settings with a large number of treatment levels, and where some of the 
treatment levels have low probability for some covariate values. In the simulations we focus 
on two designs, one with three treatment levels and one with six treatment levels, and both 
with six covariates. 

In the hrst design with three treatment levels the covariates Xii,X 2 i, and X^i are mul¬ 
tivariate normal with means zero, variances of (2,1,1) and covariances of (1, —1, —0.5); X 4 i ~ 
f/[- 3 , 3 ];X 5 i ~ Xi', andXgi ~ Bernoulh(0.5), with Xf = (1, Xi*, Xa*, Xg*, ^ 4 ^, Xg*). The 

three treatment groups are formed using multinomial regression model 

(A(l), A(2), A(3)) ~ Multinom(p(l|Xi),p(2|Xi),p(3|Xi)), 
where Di{w) is the treatment indicator, i.e. Di{w) = 1, if the unit i belongs to treat- 
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ment w, and p{w\Xi) = exp{Xf /3y,)/ exp(Xf/3^/), /3j' = (0, 0, 0,0, 0, 0, 0), = 0.7 x 

(0,1,1,1, —1,1,1), and = 0.4x (0,1,1,1,1,1,1). The outcome design is Yi{w) = Xf^^+rii 
with rj, ~ iV(0,1), 7 f = (-1.5,1,1,1,1,1,1), 72 ^ = (-3, 2, 3,1, 2, 2, 2), and 73 ^ = (1.5, 3,1, 2, 
— 1, —1, —1). The sample sizes are N^, = 500, for tc = 1, 2,3. 

We compare the seven estimators over 1000 datasets. The generalized propensity scores 
are estimated using multinomial logistic regression model with all covariates entering the 
model linearly. 95% conhdence intervals for point estimates were calculated using: (a) 2.5 
and 97.5 percentiles from 1000 bootstrap samples for DIF, GPSS, and weighting; (b) point 
estimate ±1.96 x (variance) for Abadie & Imbens (2006) variance estimator for COV 
and PSSM; and for Abadie & Imbens (2012) variance estimator for PPSM and GPSM, 
which takes into account the uncertainty of the matching procedure and the uncertainty of 
estimating generalized propensity scores, as in Web Appendix. 

Table 1 presents the bias, root mean squared error (RMSE) and coverage of 95% conhd¬ 
ence intervals. DIF shows that there are substantial biases associated with the covariates. 
PPSM compares two treatment levels at a time using the units exposed to one of those 
two treatment levels, which focuses on different populations of inference each time. This 
leads to inconsistency for simultaneous comparison of treatment levels. One implication is 
that f(l, 2) ± f(2, 3) ± f(3,1) % 0. Even with only three treatment levels, and so only two 
propensity scores to match on, PSSM did not control the bias well. The four remaining pro¬ 
cedures, including GOV, GPSM and GPSS, and weighting, do a fairly good job of reducing 
the bias for all average treatment effects. Among these four, GOV has smallest RMSE. For 
inference, asymptotic 95% conhdence intervals provide coverage very close to the nominal 
coverage for GPSM, which conhrms our inference theory in Web Appendix. Asymptotic 95% 
conhdence intervals for GPSS and weighting are also fairly accurate, but GOV leads to un¬ 
dercoverage, consistent with the results in Abadie & Imbens (2006) on the bias of matching 
estimators with multiple covariates. 

In the second design with six treatment levels, we consider propensity score design as 
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p{w I Xi) = expiXf /3^)/ /3y,>), where = (0,0, 0, 0, 0, 0, 0), /3j = 0.4 x 

(0,1,1,2,1,1,1), /3J = 0.6 X (0,1,1,1,1,1,-5), /3J = 0.8 x (0,1,1,1,1,1,5), = 1.0 x 

(0,1,1,1, —2,1,1), and /3j = 1.2 x (0,1,1,1, —2, —1,1). The outcome design is Yi{w) = 
Xf 7 ^, with 7 f = (-1.5,1,1,1,1,1,1), 72 ^ = (-3,2, 3,1, 2, 2, 2), 73 ^ = (3, 3,1, 2, -1, -1, -4), 
71 = (2.5,4,1, 2, -1, -1, -3), 75 ^ = (2, 5,1, 2, -1, -1, -2), and = (1.5, 6 ,1, 2, -1, -1, -1) 
with r]i ~ N{0, 1). The sample sizes are = 1000, for all w. 

In Figure 1 we present the results for the estimators for the hfteen average treatment 
effects. The simulation setup creates six treatment groups with strong separation in covariate 
distributions, which makes it fundamentally difficulty removing all biases in estimating hfteen 
treatment effects simultaneously. Overall GPSM outperforms the other methods in terms of 
bias and coverage rates, with the coverage rate for nominal 95% conhdence intervals never 
going below 0.75. To assess the performance of the weighting estimator it is useful to look at 
the weights that underly the estimator. Normalizing the weights so that they average to i\% 
for each of the treatment levels, the maximum weight for units in each of the treatment levels 
is 16.9 (treatment level one), 21.2 (treatment level two), 50.0 (treatment level three), 64.9 
(treatment level four), 21.2 (treatment level hve), and 185.1 (treatment level six). Even in 
the three treatment level case these maximum weights are substantial. There the maximum 
weight for units in each of the treatment levels is 16.5 (treatment level one), 95.8 (treatment 
level two), and 17.9 (treatment level three). 

In an extended simulation (see Web Appendix), we compare the performance of the 
estimators under the combinations of (w/o) trimming and (correct/incorrect) generalized 
propensity score model. When the propensity score model is incorrect, the performance for 
all methods based on the propensity score deteriorates. In particular, the weighting estimator 
shows huge bias and variance and poor coverage for all parameters. GPSM is inferior to GOV 
in terms of bias and variance; however, it presents better coverage for ten parameters out of 
hfteen parameters. This suggests that when covariates are high dimensional, the inference 
for GOV is not satisfactory. After trimming, bias and variance are greatly reduced and 
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Table 1: Simulation Results, Design I. Estimators; (1) DIF, simple difference in outcomes for 
units with different treatment levels; (2)PPSM: pairwise comparison using binary propensity 
score matching; (3) PSSM; matching on the propensity score set; (4) W: weighting estimator; 
(5) COV; matching on all covariates; (6) GPSM: matching on the generalized propensity 
score; (7) GPSS: stratification on the generalized propensity score. Variance estimators: (1) 
bootstrapping variance estimator for DIF, GPSS, and W; (2) Abadie and Imbens (2006) 
variance estimator for GOV matching and PSSM; (3) Abadie and Imbens (2012) variance 
estimator for PPSM and GPSM. 



t(1.2) 

Bias 

r(l,3) 

t(2,3) 

RMSE 

r(l,2)r(l,3) 

r(2,3) 

Goverage 95% GI 
r(l,2)r(l,3)r(2,3) 

DIF 

1.34 

0.57 

-0.77 

1.38 

0.60 

0.83 

0.01 

0.26 

0.41 

PPSM 

-0.6 

-1.1 

-0.8 

0.77 

1.16 

0.91 

0.80 

0.001 

0.74 

PSSM 

0.21 

0.19 

-0.02 

0.27 

0.33 

0.29 

0.90 

0.92 

0.98 

W 

0.05 

0.02 

-0.03 

0.55 

0.43 

0.53 

0.91 

0.97 

0.94 

GOV 

0.29 

0.19 

-0.11 

0.33 

0.22 

0.20 

0.75 

0.88 

0.99 

GPSM 

0.14 

0.04 

-0.10 

0.56 

0.36 

0.61 

0.95 

0.95 

0.95 

GPSS 

0.31 

0.05 

-0.27 

0.53 

0.24 
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0.91 

0.99 

0.94 
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Figure 1: Simulation Results, Design II. 
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coverage is improved for all parameters for GPSM, GPSS and weighting, which suggests 
that trimming can improve the performance of GPS based methods. 

8 An Application 

We re-examined data from the REFLECTIONS (Real-world Examination of Fibromyalgia: 
Longitudinal Evaluation of Costs and Treatments) study. REFLECTIONS was a 12-month 
prospective observational study of patients being treated for hbromyalgia at 58 outpatient 
sites in the US and Puerto Rico. Patients had to be at least 18 years of age and initiating 
a new pharmacologic treatment for hbromyalgia. In keeping with the observation nature 
of the study, inclusion and exclusion criteria were kept to a minimum, no requirements on 
the nature of the hbromyalgia treatment were made, and physicians decisions regarding the 
proper treatment and care of patients were made in the course of normal clinical practice. 
Data from patients were collected at baseline during a standard office visit and at 1, 3,6, 
and 12 months post baseline via a computer assisted telephone interviews. For details on 
the design of REFLECTIONS see Robinson et al. (2012). 

For this example, we focused on the analysis of three hbromyalgia medication cohorts 
(Peng et ah, 2015); patients treated with an opioid (either monotherapy or with other 
medications), patients treated with tramadol but not an opioid, and patients not treated 
with tramadol or an opioid (referred to as the Other cohort). The outcome variable utilized 
here is the total score of Fibromyalgia Impact Questionnaire (FIQ), which is composed of 
items measuring physical functioning, number of days the patient felt well, number of days 
the patient felt unable to work due to EM symptoms, and patient ratings of work difficulty, 
pain intensity, fatigue, morning tiredness, stihness, anxiety, and depression. The total score 
ranges from 0 to 80 with lower scores indicating better outcomes, and research suggests a 
14% reduction (or 7.6 points in this sample) is clinically relevant (Bennett et ah, 2009). The 
objective is to produce causal inference pairwise comparisons between the cohorts all based 
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on the same population (the population represented by the trimmed sample), in order to test 
the study hypothesis that there is no difference in FIQ total score among the three cohorts. 

The generalized propensity scores (3 estimated probabilities for each patient) were com¬ 
puted using a multinomial model with 32 predictors from demographics, baseline clinical 
characteristics, comorbidities, resource use, prior hbromyalgia treatment, and physician in¬ 
formation. 

To help address lack of overlap of the populations, the modihed Crump (Crump et ah, 
2009) algorithm of Section 7 was applied. With the REFLECTIONS data, A = 29.88, and 
thus patients were trimmed if their l/p(w|Wj) value was greater than 43.52. This 

resulted in removal of 363 patients (25% of the sample), with 31(9%) from the Opioid cohort 
(OPI), 17(8%) of the Tramadol cohort (TRA), and 315(34%) from the Other cohort (OTH). 
Thus, the analysis cohort included 1101 patients (308 OPI, 188 TRA, 605 OTH). 

The trimming primarily removed patients in OTH who had high propensities for being in 
OTH (and low propensities for either OPI or TRA) and were under-represented in OPT and 
TRA. Using the trimmed sample, generalized propensity score matching was implemented 
following the steps in Section 5 to produce counterfactual outcomes (imputed FIQ total 
scores) for each patient and cohort. The quality of the matches appeared acceptable, with 
the mean difference in propensity scores for the matched patients ranging from 0.0012 to 
0.0014 across the cohorts and the largest matched pair with a difference of 0.035. 

The unadjusted mean changes(sd) from baseline to endpoint (12 months post baseline) 
for the FIQ total score in the trimmed cohort were —2.4(12.3) for OPI, —3.7(14.0) for 
TRA, and —4.2(13.4) for OTH, indicating small numerical improvement in pain symptoms. 
Table 2 summarizes the comparative analysis of the FIQ total score improvement among 
the 3 cohorts on the trimmed sample using generalized propensity score matching (GPSM) 
and stratihcation (GPSS). As a comparison, results using no bias adjustment (Difference 
in Means), propensity score set matching (PSSM), weighting (W), and covariate matching 
(GOV) are included. Without bias adjustment, no cohort differences reach the level of 
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Table 2; Analysis results: Change from Baseline to Endpoint (12 Months) for the FIQ total 
score. OPI = Opioid Cohort; TRA = Tramadol Cohort; OTH = Other Cohort. TRA-OPI 
indicates the estimated difference in change from baseline scores for the Tramadol Cohort 
minus the same value for the Opioid cohort. Thus, negative values indicate greater reduction 
in symptoms for the first cohort. Analyses are on the Trimmed sample. Confidence Intervals 
were calculated using the same methods as for simulated study above. 



Pairwise Differences Means (95% GI) 

Method 

TRA - OPI 

OTH- OPI 

OTH - TRA 

DIF 

-1.1 (-3.8, 1.1) 

-1.7(-3.5, 0.3) 

-0.6 (-2.6, 2.2) 

PPSM 

-3.9 (-7.2, -0.6) 

-1.4 (-3.4, 0.6) 

-1.8 (-4.2, 0.6) 

PSSM 

-2.5 (-6.6, 1.6) 

-1.9 (-4.1, 0.4) 

0.7 (-3.1, 4.4) 

W 

-0.8 (-5.4, 4.7) 

-0.3 (-3.8, 5.1) 

0.4 (-2.3, 4.0) 

cov 

-1.6 (-4.8, 1.5) 

-1.5 (-3.8, 0.9) 

0.2 (-2.5, 2.8) 

GPSM 

-1.6 (-4.3,1.!) 

-0.9 (-2.8,1.!) 

0.7 (-1.8,3.2) 

GPSS 

-1.6 (-5.5, 1.2) 

-1.2 (-4.1, 0.9) 

0.4 (-2.2, 3.8) 


statistical significance, though OTH demonstrated marginally greater reductions than OPI 
(p = 0.058). Similarly, none of the adjusted differences led to any statistical signihcant 
Endings, indicating similar health condition improvements at 12 months over baseline among 
the three cohorts. Note that by using the same population across all comparisons, pairwise 
difierences across the cohorts using the generalized propensity scoring methods (GPSM, 
GPSS) are consistent, whereas using PPSM this is not true (the PPSM estimates suggest 
that changing the treatment from OPI to TRA leads to an average efiect of -3.9, changing 
the treatment efiect from TRA to OTH leads to an average efiect of -1.8, and changing the 
treatment from OPI to OTH leads to an average efiect of -1.4, which cannot all be true at 
the same time). This illustrates the impact of differing populations can have when using 
PPSM. 


9 Conclusion 


In this paper, we develop new methods for estimating causal treatment effects using obser¬ 
vational data in settings with multiple (more than two) treatment levels. Existing methods 
require additional assumptions assuming the existence of a scalar balance score, so as to 


24 



facilitate matching and subclassification. We show that, contrary to claims in the literature, 
matching and subclassihcation methods using the propensity score generalize naturally to 
the multi-level treatment case. We focus on matching and subclassihcation on the general¬ 
ized propensity score using the notion of weak unconfoundedness, and show that adjusting 
for a scalar function of the covariates can remove all biases associated with differences in 
observed covariates. 

As with other propensity based analyses, this approach depends on correct specihcation 
of propensity score modeling, and does not resolve the potential for bias due to unmeasured 
confounding. An initial simulation study and example demonstrated the potential benehts 
of the proposed approach at reducing bias and providing causal inference comparisons for 
multiple cohorts on a common population. 
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